Goto

Collaborating Authors

 data exploration


Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction

Been Kim, Julie A. Shah, Finale Doshi-Velez

Neural Information Processing Systems

We present the Mind the Gap Model (MGM), an approach for interpretable feature extraction and selection. By placing interpretability criteria directly into the model, we allow for the model to both optimize parameters related to interpretabil-ity and to directly report a global set of distinguishable dimensions to assist with further data exploration and hypothesis generation.


Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

Li, Zongxia, Calvo-Bartolomé, Lorena, Hoyle, Alexander, Xu, Paiheng, Dima, Alden, Fung, Juan Francisco, Boyd-Graber, Jordan

arXiv.org Artificial Intelligence

A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.


Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent

Nooralahzadeh, Farhad, Zhang, Yi, Furst, Jonathan, Stockinger, Kurt

arXiv.org Artificial Intelligence

International enterprises, organizations, or hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying database systems combined with other unstructured modalities such as images in natural language is widely unexplored. In this paper, we propose XMODE - a system that enables explainable, multi-modal data exploration in natural language. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) XMODE leverages a LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis. (3) Experimental results on multi-modal datasets over relational data and images demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling not only in accuracy but also in various performance metrics such as query latency, API costs, planning efficiency, and explanation quality, thanks to the more effective utilization of the reasoning capabilities of LLMs.


Automated Question Generation on Tabular Data for Conversational Data Exploration

Chaudhuri, Ritwik, C, Rajmohan, DB, Kirushikesh, Agarwal, Arvind

arXiv.org Artificial Intelligence

Exploratory data analysis (EDA) is an essential step for analyzing a dataset to derive insights. Several EDA techniques have been explored in the literature. Many of them leverage visualizations through various plots. But it is not easy to interpret them for a non-technical user, and producing appropriate visualizations is also tough when there are a large number of columns. Few other works provide a view of some interesting slices of data but it is still difficult for the user to draw relevant insights from them. Of late, conversational data exploration is gaining a lot of traction among non-technical users. It helps the user to explore the dataset without having deep technical knowledge about the data. Towards this, we propose a system that recommends interesting questions in natural language based on relevant slices of a dataset in a conversational setting. Specifically, given a dataset, we pick a select set of interesting columns and identify interesting slices of such columns and column combinations based on few interestingness measures. We use our own fine-tuned variation of a pre-trained language model(T5) to generate natural language questions in a specific manner. We then slot-fill values in the generated questions and rank them for recommendations. We show the utility of our proposed system in a coversational setting with a collection of real datasets.


Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction

Neural Information Processing Systems

We present the Mind the Gap Model (MGM), an approach for interpretable feature extraction and selection. By placing interpretability criteria directly into the model, we allow for the model to both optimize parameters related to interpretability and to directly report a global set of distinguishable dimensions to assist with further data exploration and hypothesis generation.


Collaborative business intelligence virtual assistant

Cherednichenko, Olga, Muhammad, Fahad

arXiv.org Artificial Intelligence

The present-day business landscape necessitates novel methodologies that integrate intelligent technologies and tools capable of swiftly providing precise and dependable information for decision-making purposes. Contemporary society is characterized by vast amounts of accumulated data across various domains, which hold considerable potential for informing and guiding decision-making processes. However, these data are typically collected and stored by disparate and unrelated software systems, stored in diverse formats, and offer varying levels of accessibility and security. To address the challenges associated with processing such large volumes of data, organizations often rely on data analysts. Nonetheless, a significant hurdle in harnessing the benefits of accumulated data lies in the lack of direct communication between technical specialists, decision-makers, and business process analysts. To overcome this issue, the application of collaborative business intelligence (CBI) emerges as a viable solution. This research focuses on the applications of data mining and aims to model CBI processes within distributed virtual teams through the interaction of users and a CBI Virtual Assistant. The proposed virtual assistant for CBI endeavors to enhance data exploration accessibility for a wider range of users and streamline the time and effort required for data analysis. The key contributions of this study encompass: 1) a reference model representing collaborative BI, inspired by linguistic theory; 2) an approach that enables the transformation of user queries into executable commands, thereby facilitating their utilization within data exploration software; and 3) the primary workflow of a conversational agent designed for data analytics.


Demonstration of InsightPilot: An LLM-Empowered Automated Data Exploration System

Ma, Pingchuan, Ding, Rui, Wang, Shuai, Han, Shi, Zhang, Dongmei

arXiv.org Artificial Intelligence

Exploring data is crucial in data analysis, as it helps users understand and interpret the data more effectively. However, performing effective data exploration requires in-depth knowledge of the dataset and expertise in data analysis techniques. Not being familiar with either can create obstacles that make the process time-consuming and overwhelming for data analysts. To address this issue, we introduce InsightPilot, an LLM (Large Language Model)-based, automated data exploration system designed to simplify the data exploration process. InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining. Then, these analysis intents are concretized by issuing corresponding intentional queries (IQueries) to create a meaningful and coherent exploration sequence. In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts and simplifies the exploration process for users. By employing an LLM to iteratively collaborate with a state-of-the-art insight engine via IQueries, InsightPilot is effective in analyzing real-world datasets, enabling users to gain valuable insights through natural language inquiries. We demonstrate the effectiveness of InsightPilot in a case study, showing how it can help users gain valuable insights from their datasets.


What Should Data Science Education Do with Large Language Models?

Tu, Xinming, Zou, James, Su, Weijie J., Zhang, Linjun

arXiv.org Artificial Intelligence

The rapid advances of large language models (LLMs), such as ChatGPT, are revolutionizing data science and statistics. These state-of-the-art tools can streamline complex processes. As a result, it reshapes the role of data scientists. We argue that LLMs are transforming the responsibilities of data scientists, shifting their focus from hands-on coding, data-wrangling and conducting standard analyses to assessing and managing analyses performed by these automated AIs. This evolution of roles is reminiscent of the transition from a software engineer to a product manager. We illustrate this transition with concrete data science case studies using LLMs in this paper. These developments necessitate a meaningful evolution in data science education. Pedagogy must now place greater emphasis on cultivating diverse skillsets among students, such as LLM-informed creativity, critical thinking, AI-guided programming. LLMs can also play a significant role in the classroom as interactive teaching and learning tools, contributing to personalized education. This paper discusses the opportunities, resources and open challenges for each of these directions. As with any transformative technology, integrating LLMs into education calls for careful consideration. While LLMs can perform repetitive tasks efficiently, it's crucial to remember that their role is to supplement human intelligence and creativity, not to replace it. Therefore, the new era of data science education should balance the benefits of LLMs while fostering complementary human expertise and innovations. In conclusion, the rise of LLMs heralds a transformative period for data science and its education. This paper seeks to shed light on the emerging trends, potential opportunities, and challenges accompanying this paradigm shift, hoping to spark further discourse and investigation into this exciting, uncharted territory.


On the Potential of Artificial Intelligence Chatbots for Data Exploration of Federated Bioinformatics Knowledge Graphs

Sima, Ana-Claudia, de Farias, Tarcisio Mendes

arXiv.org Artificial Intelligence

In this paper, we present work in progress on the role of artificial intelligence (AI) chatbots, such as ChatGPT, in facilitating data access to federated knowledge graphs. In particular, we provide examples from the field of bioinformatics, to illustrate the potential use of Conversational AI to describe datasets, as well as generate and explain (federated) queries across datasets for the benefit of domain experts.


A Reference Model for Collaborative Business Intelligence Virtual Assistants

Cherednichenko, Olga, Muhammad, Fahad, Darmont, Jérôme, Favre, Cécile

arXiv.org Artificial Intelligence

Collaborative Business Analysis (CBA) is a methodology that involves bringing together different stakeholders, including business users, analysts, and technical specialists, to collaboratively analyze data and gain insights into business operations. The primary objective of CBA is to encourage knowledge sharing and collaboration between the different groups involved in business analysis, as this can lead to a more comprehensive understanding of the data and better decision-making. CBA typically involves a range of activities, including data gathering and analysis, brainstorming, problem-solving, decision-making and knowledge sharing. These activities may take place through various channels, such as in-person meetings, virtual collaboration tools or online forums. This paper deals with virtual collaboration tools as an important part of Business Intelligence (BI) platform. Collaborative Business Intelligence (CBI) tools are becoming more user-friendly, accessible, and flexible, allowing users to customize their experience and adapt to their specific needs. The goal of a virtual assistant is to make data exploration more accessible to a wider range of users and to reduce the time and effort required for data analysis. It describes the unified business intelligence semantic model, coupled with a data warehouse and collaborative unit to employ data mining technology. Moreover, we propose a virtual assistant for CBI and a reference model of virtual tools for CBI, which consists of three components: conversational, data exploration and recommendation agents. We believe that the allocation of these three functional tasks allows you to structure the CBI issue and apply relevant and productive models for human-like dialogue, text-to-command transferring, and recommendations simultaneously. The complex approach based on these three points gives the basis for virtual tool for collaboration. CBI encourages people, processes, and technology to enable everyone sharing and leveraging collective expertise, knowledge and data to gain valuable insights for making better decisions. This allows to respond more quickly and effectively to changes in the market or internal operations and improve the progress.